In [ ]:
 

Clipping Outliers: NYC hotel pricing dataset analysis¶

Sometimes, while analyzing a dataset, there can be some data present which might exert undue influence while building models, like linear regression. These data are called outliers. Outliers can sometimes mislead the set of data and influence model performance as well.

First, lets learn about some basics about outliers.

Click on the "Pre Knowledge" section above to know the basics about outliers

In data science, outliers are values within a dataset that vary greatly from the others, they are either much larger, or significantly smaller. Outliers can appear in a dataset due to variability of measurement, error in data, experimental error etc. Outliers can cause machine learning models to make inaccurate predictions when they are included in the training data, so they need to be handled before training a model.

One of the best ways to understand outliers is box plots.

Boxplots are very useful to see the distribution of a variable/feature and detect outliers in them. It is a useful graphical representation for describing the behavior of the data in the middle as well as both ends of the distribution. A box plot shows the data based on the five-number summary:

  • Minimum: the lowest data point in a variable excluding any outliers
  • Median (Q2 or 50th percentile): the middle value in the variable
  • First quartile (Q1 or 25th percentile): also known as the lower quartile (0.25)
  • Third quartile (Q3 or 75th percentile): also known as the upper quartile (0.75)
  • Maximum: the highest data point in the variable excluding any outliers

The difference between the lower quartile and the upper quartile(Q3 - Q1) is called the interquartile range or IQR.

Boxplots help us find the outliers in the data by using the IQR. As a rule, values that are outside the range of 1.5*IQR from Q1 and Q3 are regarded as outliers. The below image will help us better understand the outliers in our data.

boxplot icon

In the image above, the points that are outside the whisker lines are the outliers.

There are different techniques to handle outliers in a dataset. In our example, we will use the concept of clipping (winsorizing).

Clipping data from a dataset means to clip the data at the last permitted extreme value, e.g. the 5th or 95th percentile value. For example, when we clip the data to 95th percentile, values over the 95th percentile will be set to the 95th percentile value meaning all the values greater than 95% percent will equal to the 95th percentile value.

The following data set has several (bolded) extremes:

  • {0.1, 1, 12, 14, 16, 18, 19, 21, 24, 26, 29, 32, 33, 35, 39, 40, 41, 44, 99, 125}

After clipping/winsorizing the top and bottom 10% of the data(matching those values to the nearest extreme), we get:

  • {12, 12,12, 14, 16, 18, 19, 21, 24, 26, 29, 32, 33, 35, 39, 40, 41, 44, 44, 44}

Let us solve a problem that replaces outliers from data using clipping.


Let us solve a problem that replaces outliers from data using clipping.

Problem Description¶

We have a dataset named nyc_airbnb.csv, which contains data about price of AirBnb houses per-night. In the dataset, we want to analyze the price feature data. Before analysis, we want to make sure there exists no outliers in the price data. If there exists any outliers,we want to remove those outliers by using the winsorizing/clipping method.

First , we load our dataset into a dataframe and view it.

Load the Dataset and View Data

Step 1: import the pandas library as pd

In [1]:
import pandas as pd

Step 2: Load the data into a dataframe nyc using read_csv method in pandas

In [2]:
nyc= pd.read_csv("../datasets/nyc_airbnb.csv")

Step 3: View the data stored in dataframe nyc

In [3]:
nyc
Out[3]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48890 36484665 Charming one bedroom - newly renovated rowhouse 8232441 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 -73.94995 Private room 70 2 0 NaN NaN 2 9
48891 36485057 Affordable room in Bushwick/East Williamsburg 6570630 Marisol Brooklyn Bushwick 40.70184 -73.93317 Private room 40 4 0 NaN NaN 2 36
48892 36485431 Sunny Studio at Historical Neighborhood 23492952 Ilgar & Aysel Manhattan Harlem 40.81475 -73.94867 Entire home/apt 115 10 0 NaN NaN 1 27
48893 36485609 43rd St. Time Square-cozy single bed 30985759 Taz Manhattan Hell's Kitchen 40.75751 -73.99112 Shared room 55 1 0 NaN NaN 6 2
48894 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814 Christophe Manhattan Hell's Kitchen 40.76404 -73.98933 Private room 90 7 0 NaN NaN 1 23

48895 rows × 16 columns

Check for outliers in price data:

Since we are looking to find outlier existence in hotel price , an effective way of detecting outliers is using visualizations. To check if there exists any outliers in price data, strip plots can be a very useful graph to see how the datapoints of a feature/variable is spread.

Strip Plot Information

A strip plot is a single-axis scatter plot that is used to visualise the distribution of many individual one-dimensional values. The values are plotted as dots along one unique axis, and the dots with the same value can overlap. It useful for observing variability, clustering, and outliers in small datasets.

boxplot icon

For our strip plot, we visualize every datapoint of the price data. We look at the spread of price data in the y axis. For this plot, we import the plotly express library.

Step 1: First, import the plotly.express library as px

In [1]:
import plotly.express as px
                

Step 2: Call the strip() method and store it in variable price_strip

In [9]:
price_strip = px.strip(nyc, y='price')
price_strip = px.strip(nyc, y='price')

Explanation: Using px, call the strip() method with the following parameters

  • nyc: variable where the data is stored
  • price: the data column/feature to plot in the y axis
price_strip = px.strip(nyc, y='price')

Explanation: Store the resulting plot into variable price_strip.

Step 3: Display the variable price_strip using the show() method

In [9]:
price_strip.show()

From the strip plot, we can see that the data range is between 0-10,000. The majority of the datapoints are within the range of 0-2000 and a few of them are over 4000. We can predict that there is a high chance of outlier existence in the price data, most likely beyond the 4000 data points. But we need to confirm the exact outlier points to apply clipping to the price data.

Make a Box Plot for Outlier Estimation¶

To know the exact outliers points, boxplots are very useful. It gives us the visualization of the exact outlier points in the data.

Let us plot a boxplot on the price data and confirm the outlier points.

Box Plot Information

A boxplot (or box-and-whisker plot) is a standardized way of displaying the distribution of data based on a five-number summary:

  • Minimum: The smallest data point (excluding outliers).
  • First Quartile (Q1): The median of the lower half of the dataset (25th percentile).
  • Median (Q2): The middle value of the dataset (50th percentile).
  • Third Quartile (Q3): The median of the upper half of the dataset (75th percentile).
  • Maximum: The largest data point (excluding outliers).

Key Components:

  • Box: Represents the interquartile range (IQR) — the middle 50% of the data (from Q1 to Q3)
  • Median Line: A line inside the box that shows the median of the data.
  • Whiskers: Extend from Q1 and Q3 to the minimum and maximum values within 1.5 × IQR.
  • Outliers: Data points beyond 1.5 × IQR are plotted as individual points.

boxplot icon

Step 1: First, import the seaborn library as sns

In [1]:
import seaborn as sns
                  
%matplotlib inline

Step 2: Call the boxplot() method using the sns library.

In [9]:
sns.boxplot(data = nyc, x = 'price', width = 0.4)
sns.boxplot(data = nyc, x = 'price', width = 0.4)

Explanation: Using sns, call the boxplot() method with following parameters

  • sns.boxplot(data = nyc, x='price', width=0.4)
    data: parameter where the source data is mentioned. nyc in this case
  • sns.boxplot(data = nyc, x = 'price', width = 0.4)
    x: the feature to plot in the x axis(horizontal view). price in this case
  • sns.boxplot(data = nyc, x = 'price',width = 0.4)
    width: the width of the boxplot
sns.boxplot(data = nyc, x = 'price', width = 0.4)

Explanation: Using sns, call the boxplot() method

Out[12]:
<Axes: xlabel='price'>
No description has been provided for this image

From the boxplot, we can see that most of the datapoints are outside the whisker lines (on the right side, these are actually the outliers). The valid range of values (that are not outliers) would be within the whisker points (range of Q1-1.5IQR and Q3 + 1.5IQR). The datapoints outside this range are the outlier points.

Our task now is to find these limit points to replace the outliers using winsoriing/clipping.

Use describe() for price distribution:

To find the exact non-outlier points range in price, we need to see the numerical distribution of price in the five number summary. This will give us a better understanding of where the outliers lie. We will use the describe method to look at the distribution of price data.

Step: Use the describe() method on the price data. This will show the price data distribution on the five number summary

In [9]:
nyc['price'].describe()
nyc['price'].describe()

Explanation: select the feature price from variable nyc:

nyc['price'].describe()

Explanation: Apply the describe() method on the price feature

Out[8]:
count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

Here, if we look in the price distribution, we can see the min, 25%, 50%, 75% and max price value of price. The price values ranges from 0(min) to 10000(max) and the 75% percentile value of price is 175. This means 75% of the price values is within 0 -175. So, lots of datapoints are in the 75-100% range . Our outliers lie mostly in this region of values.

Find the Outliers:¶

Using the quantile() method, we find out our interquartile(IQR) range alongside lower and upper limits for outlier points. For that , we find out the q3 and q1 points in the price data.

Interquartile Range

Girl in a jacket

Q3 Calculation Information:

The q3 point is the data point that represents the 75th percentile value in a variable (price data in our case).

Step 1: Select the 75th percentile value from the price data in nyc using the quantile() method. Store the result in variable q3.

In [9]:
q3 = nyc['price'].quantile(0.75)
  • q3 = nyc['price'].quantile(.75)
    Explanation: Select the price column from the variable nyc:
  • q3 = nyc['price'].quantile(.75)
    Explanation: Apply the quantile() method on price and set 75th percentile value as parameter
  • q3 = nyc['price'].quantile(.75)
    Explanation: Store the result to the variable q3:

Step 2: Print the variable q3

In [10]:
print("q3:",q3)
                  
q3: 175.0

Q1 Calculation Information:

The q1 point is the data point that represents the 25th percentile value in a variable (price data in our case).

Step 1: Select the 25th percentile value from the price data in nyc using the quantile() method. Store the result in variable q1.

In [9]:
q1 = nyc['price'].quantile(0.25)
  • q1 = nyc['price'].quantile(.25)
    Explanation: Select the price column/feature from the variable nyc:
  • q1 = nyc['price'].quantile(.25)
    Explanation: Apply the quantile() method on price and set 75th percentile value as parameter
  • q1 = nyc['price'].quantile(.25)
    Explanation: Set the result to the variable q3

Step 2: Print the variable q1.

In [12]:
print("q1:",q1)
q1: 69.0

Find the Interquartile Range(IQR):¶

After calculating q1 and q3, we find out the interquartile range which is the difference between these two points. This range will determine the outlier points(upper and lower bound) for our price data.

Step 1: Substract q3 from q1 and store the result in a variable iqr

In [13]:
iqr = q3 - q1

Step 2: Print the variable iqr.

In [14]:
print("iqr:",iqr)
iqr: 106.0

Calculate the upper and lower bound for outliers:

After finding the interquartile range(iqr), the next task is to calculate the upper and lower bound points to find out the outlier range in the data. We use the iqr and the calculated q1,q3 data points to find out the lower and upper bound points.

Upper Bound Calculation Information

For the upper bound, the point will be 1.5*iqr from q3 which means (q3 + 1.5*iqr)

Step 1: Define the range q3 + 1.5*iqr . Store it in a variable upper_bound

In [15]:
upper_bound = q3 + 1.5*iqr

Step 2: Print the upper_bound

In [16]:
print("upper bound",upper_bound)
upper bound 334.0

Lower Bound Calculation Information

For the lower bound, the point will be 1.5*iqr from q1 which means (q1 - 1.5*iqr)

Step 1: Define the lower bound range q1 - 1.5*iqr. Store it in a variable lower_bound

In [18]:
lower_bound = q1 - 1.5*iqr

Step 2: Print the lower_bound

In [19]:
print("lower bound",lower_bound)
lower bound -90.0

Clip the Outliers:¶

Find the Clipping points:

Now that we have found the outlier bounds, we need to clip the price data according to these bounds. The clipping points for the upper and lower bound will be as follows:

  • lower_point= max(lower_bound, nyc['price'].min())
  • upper_point= min(upper_bound, nyc['price'].max())

Lower Point Calculation Information:

For the lower_point, we are actually taking the max value between the minimum price value (0) and the lower_bound because we can go as far as the minimum data point that we have. In this example, we found that our lower_bound is -90. But our lowest data point is 0. So we dont need the lower bound as -90. We need to go only as far as the lowest data point that is 0. That is why we will take the max() for lower bound calculation.

Step 1: Apply the lower point formula from above and store the result in variable lower_point.

In [9]:
lower_point = max(lower_bound, nyc['price'].min())
lower_point = max(lower_bound, nyc['price'].min())

Explanation: Set the max() function with following parameters

  • lower_bound: the lower bound calculated using the IQR formula
  • nyc['price'].min(): the minimum value from the price data.
lower_point = max(lower_bound, nyc['price'].min())

Explanation: Store the result of the max() function in variable lower_point

Step 2: Print the variable lower_point

In [20]:
print("lower_point", lower_point)
lower_point 0

Upper Point Calculation Information

For the upper_point, We need to go only as far as the highest datapoint. The maximum datapoint we have is 10,000 and our upper bound is 334. So we will take the minimum from these two points.

That is why we are using this min() for upper point calculation.

Step 1: Apply the upper point formula from above and store the result in variable upper_point

In [9]:
upper_point = min(upper_bound, nyc['price'].max())
upper_point = min(upper_bound, nyc['price'].max())

Explanation: Set the min() function with following parameters

  • upper_bound: the upper bound calculated using the IQR formula
  • nyc['price'].max(): the maximum value from the price data.
upper_point = min(upper_bound, nyc['price'].max())

Explanation: Store the result of the min() function in variable upper_point

Step 2: Print the variable upper_point

In [21]:
print("upper_point", upper_point)
upper_point 334.0

Clip outliers using the clipping points:

Now that we have found our lower and upper point, we clip our data according to them.After clipping, as shown in our introductory example, the points less than the lower_point will be set to the lower_point and the points greater than the upper point will be set to the upper_point. By doing this , it will clip the outlier points from the price data. This method of removing outliers from data is called winsorizing.

Step 1: Apply the clip method on the price data using the following formula

In [9]:
nyc['price'].clip(lower_point, upper_point)
nyc['price'].clip(lower_point, upper_point)

Explanation: Set the clip() method with following parameters

  • lower_point: the lowest point for clipping the price data
  • upper_point: the highest point for clipping the price data
nyc['price'].clip(lower_point, upper_point)

Explanation: Apply the clip() method on the price variable.

Step 2: Set the result to the price column of nyc dataframe to make the changes permanent

In [22]:
nyc['price'] = nyc['price'].clip(lower_point, upper_point)

Check final clipped data:¶

After clipping, We check the five number summary of price data using the describe method.

Step 1: Select the price column from the variable nyc

In [9]:
nyc['price'].describe()
nyc['price'].describe()

Explanation: select the variable price from variable nyc:

nyc['price'].describe()

Explanation: Apply the describe() method on the price variable

Out[23]:
count    48895.000000
mean       132.979753
std         83.530504
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max        334.000000
Name: price, dtype: float64

After clipping the price data, we can see the distribution has become more dense (0-334) which was in the range(0-10000) before. The new price distribution is now more densed compared to before. Let us use the scatter plot again to observe its spread.

Visualize the clipped data¶

Let us see how our strip plot looks now compared to the intial one we plotted and see the difference.

In [9]:
price_strip2 = px.strip(nyc, y='price')
price_strip2 = px.strip(nyc, y='price')

Explanation: Using px, call the strip() method with the following parameters

  • nyc: variable where the data is stored
  • price: the data column/feature to plot in the y axis
price_strip2 = px.strip(nyc, y='price')

Explanation: Store the resulting plot into variable price_strip2.

Step 3: Display the variable price_strip using the show() method

In [6]:
price_strip2.show()

Now lets visualize the boxplot to check if outliers exist in the price data.

In [9]:
sns.boxplot(data= nyc, x='price', width=0.4)
sns.boxplot(data = nyc, x = 'price', width = 0.4)

Explanation: Using sns, call the boxplot() method with following parameters

  • sns.boxplot(data = nyc, x='price', width=0.4)
    data: parameter where the source data is mentioned. nyc in this case
  • sns.boxplot(data = nyc, x = 'price', width = 0.4)
    x: the feature to plot in the x axis(horizontal view). price in this case
  • sns.boxplot(data = nyc, x = 'price',width = 0.4)
    width: the width of the boxplot
sns.boxplot(data = nyc, x = 'price', width = 0.4)

Explanation: Using sns, call the boxplot() method

No description has been provided for this image

From the strip plot, we can see that the price points are closely condensed within the range 0-350. There are very dispersion in the data points.
Similarly, from the boxplot, there are no outlier points(dotted points) beyond the whisker lines which indicates that the outliers points have been replaced by clipping/winsorizing.

Conclusion¶

By using the clip method, we have clipped our outliers from the price data using winsorizing. Now using this dataset will give us good predictions of airbnb hotel prices in New York.

In [ ]: